Vector Space Models for Phrase-based Machine Translation

نویسندگان

  • Tamer Alkhouli
  • Andreas Guta
  • Hermann Ney
چکیده

This paper investigates the application of vector space models (VSMs) to the standard phrase-based machine translation pipeline. VSMs are models based on continuous word representations embedded in a vector space. We exploit word vectors to augment the phrase table with new inferred phrase pairs. This helps reduce out-of-vocabulary (OOV) words. In addition, we present a simple way to learn bilingually-constrained phrase vectors. The phrase vectors are then used to provide additional scoring of phrase pairs, which fits into the standard log-linear framework of phrase-based statistical machine translation. Both methods result in significant improvements over a competitive in-domain baseline applied to the Arabic-to-English task of IWSLT 2013.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Phrase Pair Rescoring With Term Weighting For Statistical Machine Translation

We propose to score phrase translation pairs for statistical machine translation using term weight based models. These models employ tf.idf to encode the weights of content and non-content words in phrase translation pairs. The translation probability is then modeled by similarity functions defined in a vector space. Two similarity functions are compared. Using these models in a statistical mac...

متن کامل

Phrase Pair Rescoring with Term Weighting for Statistical Machine Translatio

We propose to score phrase translation pairs for statistical machine translation using term weight based models. These models employ tf.idf to encode the weights of content and non-content words in phrase translation pairs. The translation probability is then modeled by similarity functions defined in a vector space. Two similarity functions are compared. Using these models in a statistical mac...

متن کامل

A new model for persian multi-part words edition based on statistical machine translation

Multi-part words in English language are hyphenated and hyphen is used to separate different parts. Persian language consists of multi-part words as well. Based on Persian morphology, half-space character is needed to separate parts of multi-part words where in many cases people incorrectly use space character instead of half-space character. This common incorrectly use of space leads to some s...

متن کامل

Rebuilding Phrase Table Scores from Monolingual Resources Using Neural Networks Vector Representations

In this paper, we propose two new features for estimating phrase-based machine translation parameters from mainly monolingual data. Our method is based on two recently introduced neural network vector representation models for words and sentences. It is the first time that these models have been used in an end to end phrase-based machine translation system. Scores obtained from our method can r...

متن کامل

Learning Bilingual Distributed Phrase Representations for Statistical Machine Translation

Following the idea of using distributed semantic representations to facilitate the computation of semantic similarity between translation equivalents, we propose a novel framework to learn bilingual distributed phrase representations for machine translation. We first induce vector representations for words in the source and target language respectively, in their own semantic space. These word v...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2014